Entropy is a measure of uncertainty. Distributions with high entropy have more uncertainty because they have more possibilities.
Code
# Create example dataset.seed(123)# Low entropy data (concentrated around a single value)low_entropy <-data.frame(value =sample(x =c(1:6), 10000, replace = T, prob =c(.3, .25, .2, .1, .05, 0)),type ="Low Entropy")# High entropy data (more spread out, more uniform)high_entropy <-data.frame(value =sample(x =c(1:6), 10000, replace = T),type ="High Entropy")# Combine the dataentropy_data <-rbind(low_entropy, high_entropy)# Create the plotsggplot(entropy_data, aes(x = value)) +geom_histogram(fill ="#1c5253", alpha =0.5, binwidth =1, color ="white") +facet_wrap(~type) +labs(title ="Comparing High and Low Entropy Distributions",x ="Value",y ="Density") +theme_cowplot() +theme(strip.background =element_rect(fill ="#1c5253"),strip.text =element_text(color ="white", size =12))
When you choose a prior and a likelihood function for your data – the distribution representing the behavior of your outcome – the best choice is the distribution that maximizes entropy while still meeting the constraints of your variable.
maximum entropy
First, the distribution with the biggest entropy is the widest and least informative distribution. Choosing the distribution with the largest entropy means spreading probability as evenly as possible, while still remaining consistent with anything we think we know about a process.
For priors: it means choosing the least informative distribution consistent with any partial scientific knowledge we have about a parameter.
For likelihoods: it means selecting the distribution we’d get by counting up all the ways outcomes could arise, consistent with the constraints on the outcome variable.
Second, nature tends to produce empirical distributions that have high entropy.
Third, it works.
what are your options?
So many! But here are a few:
binomial
outcome of each trial is binary (yes/no, happened/didn’t happen)
fixed number of trials
A Bernoulli model is a special case of the binomial with only 1 trial
probability of outcome is the same across trials
trials are independent
In general, this distribution is your best bet for binary outcomes.
Code
# Create data for different rate parametersx <-seq(0, 100, by=1)N <-c(1, 5, 100)p =c(.25, .40, .70)binom_data <-expand.grid(x = x, N=N, p=p) %>%filter(x <= N) %>%mutate(density =dbinom(x, prob = p, size=N),p=str_c("p = ", p),N =factor(N, levels=c(1, 5, 100), labels=c("001", "005", "100")),N=str_c("N = ",N))ggplot(binom_data, aes(x = x, y = density, fill = p)) +#geom_line(linewidth = 1) +geom_bar(stat="identity") +labs(title ="Binomial Distribution with Different Parameters",x ="x",y ="Density",color ="p") +theme_cowplot() +facet_wrap(p~N, scales="free") +guides(color=F, fill=F)
exponential
constrained to be zero or positive.
distance and duration, or displacement from some point of reference.
If the probability of an event is constant in time or across space, then the distribution of events tends towards exponential.
maximum entropy among all non-negative continuous distributions with the same average displacement. * described by a single parameter, the rate of events \(\lambda\), or the average displacement \(\lambda^{-1}\).
common to survival and event history analysis
Code
# Create data for different rate parametersx <-seq(0, 5, length.out =1000)rates <-c(0.5, 1, 2)exp_data <-expand.grid(x = x, rate = rates) %>%mutate(density =dexp(x, rate),rate =factor(rate, labels =paste("λ =", rates)))ggplot(exp_data, aes(x = x, y = density, color = rate)) +geom_line(size =1) +labs(title ="Exponential Distribution with Different Rate Parameters",x ="x",y ="Density",color ="Rate") +theme_cowplot() +scale_color_manual(values =c("#1c5253", "#5e8485", "#0f393a"))
gamma
constrained to be zero or positive.
distance and duration, or displacement from some point of reference
can have a peak above zero (exponential cannot)
maximum entropy among all distributions with the same mean and same average logarithm
shape is described by two parameters (\(a\) and \(b\))
Code
# Create data for different shape parametersx <-seq(0, 10, length.out =1000)shapes <-list(c(1, 1), c(2, 2), c(5, 1))names <-c("a=1, b=1", "a=2, b=2", "a=5, b=1")gamma_data <-map_df(seq_along(shapes), function(i) {data.frame(x = x,density =dgamma(x, shape = shapes[[i]][1], rate = shapes[[i]][2]),params = names[i] )})ggplot(gamma_data, aes(x = x, y = density, color = params)) +geom_line(size =1) +labs(title ="Gamma Distribution with Different Shape and Rate Parameters",x ="x",y ="Density",color ="Parameters") +theme_cowplot() +scale_color_manual(values =c("#1c5253", "#5e8485", "#0f393a"))
poisson
count-distributed (like binomial)
actually a special case of the binomial where \(n\) is large and \(p\) is small
used for counts that never get close to any theoretical maximum
As a special case of the binomial, it has maximum entropy under exactly the same constraints
described by a single parameter, the rate of events \(\lambda\)
Code
# Create data for different lambda parametersx <-0:15lambdas <-c(1, 3, 7)pois_data <-expand.grid(x = x, lambda = lambdas) %>%mutate(probability =dpois(x, lambda),lambda =factor(lambda, labels =paste("λ =", lambdas)))ggplot(pois_data, aes(x = x, y = probability, fill = lambda)) +geom_col(position ="dodge", alpha =0.7) +labs(title ="Poisson Distribution with Different Rate Parameters",x ="Count",y ="Probability",fill ="Lambda") +theme_cowplot() +scale_fill_manual(values =c("#1c5253", "#e07a5f", "#f2cc8f"))
motivating dataset
From McElreath’s lecture:
# generative model, basic mediator scenarioset.seed(0319)N <-1000# number of applicants# even gender distributionG <-sample( 1:2, size=N, replace=TRUE )# gender 1 tends to apply to department 1, 2 to 2D <-rbinom( n=N, size=1, prob=ifelse( G==1 , 0.3 , 0.8 ) ) +1# matrix of acceptance ratesaccept_rate <-matrix( c(0.5, 0.2, 0.1, 0.3), nrow=2)# simulate acceptanceA <-rbinom( n=N, size=1, accept_rate[D,G])dat <-data.frame(D=as.character(D), G=as.character(G), A)
We’re going to fit two models here, one estimating the total effect of gender on acceptance, and one estimating the direct effect stratifying on department.
m1 =brm(data = dat,family = bernoulli, A ~0+ G,prior =c( prior( normal(0, 1), class = b)), iter =5000, warmup =1000, chains =4, seed =3,file =here("files/models/51.1"))
m1
Family: bernoulli
Links: mu = logit
Formula: A ~ 0 + G
Data: dat (Number of observations: 1000)
Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup draws = 16000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
G1 -1.64 0.13 -1.90 -1.40 1.00 12575 9503
G2 -1.04 0.10 -1.23 -0.85 1.00 14831 10819
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
m2 =brm(data = dat,family = bernoulli, A ~ G*D,prior =c( prior( normal(0, 1), class = Intercept),prior( normal(0, 1), class = b)), iter =5000, warmup =1000, chains =4, seed =3,file =here("files/models/51.2"))
m2
Family: bernoulli
Links: mu = logit
Formula: A ~ G * D
Data: dat (Number of observations: 1000)
Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
total post-warmup draws = 16000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.07 0.16 -2.39 -1.75 1.00 9026 10062
G2 0.33 0.29 -0.25 0.88 1.00 7352 8462
D2 1.15 0.24 0.68 1.63 1.00 7984 8930
G2:D2 -0.33 0.35 -1.00 0.35 1.00 6300 7196
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
A reminder that we’re running a LOGISTIC REGRESSION, which is a special case of a BINOMIAL REGRESSION. We have used a link function – the logit link function – to convert our outcome (probability) into something that is not bounded and can be estimated using linear equations. But now we need to go back from the logit to the probability.
\[
\text{logit} = \text{log}(\frac{p}{1-p})
\]
\[
p = \frac{\text{exp}(q)}{1 + \text{exp}(q)}
\]
It doesn’t matter how you fit the model. Use add_epred_draws() to get estimated values for each combination. This also gives you the probability, not the logit. Yay!
Using the data UCBadmit from the rethinking package, fit
a model estimating the total effect of gender on acceptance, and
a model estimating the direct effect of gender stratifying by department.
Hint: these data are given in the aggregated form, not the long form. You’ll need to use a binomial distribution, not a bernoulli. And you’ll need to specify the number of trials. So your formula will look like this:
admit |trials(applications) ~ [predictors go here]
Explore your options with get_variables() and use the spread draws() function to sample from the posterior distribution of one of your models. What does this return?
# A tibble: 192,000 × 7
dept gender applications admit reject .draw .epred
<fct> <fct> <int> <int> <int> <int> <dbl>
1 A male 825 512 313 1 378.
2 A male 825 512 313 2 358.
3 A male 825 512 313 3 356.
4 A male 825 512 313 4 354.
5 A male 825 512 313 5 371.
6 A male 825 512 313 6 362.
7 A male 825 512 313 7 377.
8 A male 825 512 313 8 363.
9 A male 825 512 313 9 352.
10 A male 825 512 313 10 384.
# ℹ 191,990 more rows
exercise
Recreate this figure (from the model without department): this figure shows that women applicants are accepted at a rate .09 to .18 lower than male applicants.